Skip to content

Conversation

mxhbl
Copy link
Collaborator

@mxhbl mxhbl commented Sep 1, 2025

This PR adds a more robust interface for the ghash function, allowing the user to manually select a hashing algorithm. This is necessary because the Julia Base.hash is not resistant enough to collisions -- I recently ran into a collision while working with a set of about 80000 graphs (see also the discussion in #27 and the link therein).

Main changes:

  • It is now possible to choose from five different hashing algorithms:
    • xxHash. Both in 64 or 128 bits. Fast and reasonably secure. This is now the default.
    • The Julia SHA library. Both in 64 or 128 bits. This was already used for large graphs, but can now be chosen through the interface. It's pretty slow, but the most secure.
    • The Julia Base.hash is still available, but not recommended.
  • To select the hashing algorithm, there is a simple struct-based interface. For example, to use xxHash with 64 bits, use ghash(g; alg=XXHash64Alg()). Algorithm choice is explained in the docstring to ghash.
  • Since the hash values depend on the chosen algorithm, graph hashes are now cached together with the algorithm that was used to compute them. For this there is a simple HashCache type that holds 64bit and 128bit hashes together with the algorithms. For now this is only internal, but something like this could also be exported as a convenience/safety layer that checks if hash algorithms are matching before comparing hashes.
  • Hashes of identical graphs that are represented with different bitpacking types are now not the same anymore. This could be changed, but computing hashes in a bitpacking-agnostic way is more expensive. Since changing the bitpacking is not really public interface, and since the bitpacking type should be UInt64 in 99.9% of cases anyway, I think this is not a problem. It may become a problem in the future though, if we want to compare hashes between a DenseNautyGraph and a SparseNautyGraph.

Notes:

  • Even though using xxHash requires a few more allocations, it is generally faster than the Base Julia hash.
  • There is a Julia wrapper for the xxHash library, but it is quite a heavy dependency (its depends on CBinding.jl which takes quite long to precompile on my machine.) Since I only need a tiny subset of the functionality of xxHash, I am depending on xxHash_jll directly and call the C interface myself.
  • In the new implementation of ghash I am not using multiple dispatch to select the hash algorithm, but instead I use type checks on the algorithm structs, which should compile away. @Krastanov: I guess this should be equivalent, but do you think it is better/safer to use multiple dispatch?

@mxhbl
Copy link
Collaborator Author

mxhbl commented Sep 11, 2025

Thinking about this a bit more, the tiny performance gains from caching hashes is not worth it. This last change removes hash caching and simply stores an iscanon flag for every graph that records if the graph is in canonical form or not. If it is canonical, we do not need to call nauty before hashing or isomorphism checking. The user can also query this via the new iscanon(::AbstractNautyGraph) function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant